xxxxxxxxxx# College dataset__Description__ Statistics for a large number of US Colleges from the 1995 issue of US News and World Report. Dimensions : 777 x 18 [Short description of variables (appendix)](#Short-description-of-variables)__Sources__ This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. The dataset was used in the ASA Statistical Graphics Section's 1995 Data Analysis Exposition.__References__ This dataset is a part of the course material of the [book](https://www.statlearning.com/) : ___Introduction to Statistical Learning with R___ (Ch 02 - Statistical Learning - Applied Exercises - Problem 8)Description
Statistics for a large number of US Colleges from the 1995 issue of US News and World Report.
Dimensions : 777 x 18
Short description of variables (appendix)
Sources
This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. The dataset was used in the ASA Statistical Graphics Section's 1995 Data Analysis Exposition.
References
This dataset is a part of the course material of the book : Introduction to Statistical Learning with R
(Ch 02 - Statistical Learning - Applied Exercises - Problem 8)
xxxxxxxxxx###### ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------xxxxxxxxxx### Index- [Short description of variables (appendix)](#Short-description-of-variables)- [2.8a - Import data](#2.8a---Import-data) - [Preliminary_Observations](#Preliminary_Observations)- [2.8b - Data preparation](#2.8b---Data-preparation)- [2.8c - Data exploration](#2.8c---Data-exploration) - [2.8c.1 - Summary statistics](#2.8c.1---Summary-statistics) - [2.8c.2 - Scatterplot matrix](#2.8c.2---Scatterplot-matrix) - [Observations - Pairplots](#Observations_Pairplots) - [Correlation Matrix](#Correlation) - [2.8c.3 - Boxplot](#2.8c.3---Boxplot) - [Observations - Outstate v Private](#Observations_-_Outstate_~_Private) - [2.8c.4 - Elite](#2.8c.4---Elite) - [Observations](#Observations_-_Elite) - [2.8c.5 - Histograms](#2.8c.5---Histograms) - [a) Student expenditure related variables](#a%29-Student-expenditure-related-variables) - [Observations](#Observations_-_Student) - [b) Faculty and student related ratios](#b%29-Faculty-and-student-related-ratios) - [Observations](#Observations_-_Faculty_Student_Ratios) - [2.8c.6 - Further data exploration](#2.8c.6---Further-data-exploration) - [a) Spending patterns - private vs non-private](#a%29-Spending-patterns---private-vs-non-private) - [Observations](#Observations_-_Student_spending) - [b) Most sought after college/university](#b%29-Most-sought-after-college/university) - [Most Sought-after Colleges/Universities (Final list)](#Most-Sought-after-Colleges/Universities-(Final-list%29) - [c) Further analysis of Most sought-after colleges/univeristies](#c%29-Further-analysis-of-Most-sought-after-colleges/univeristies) - [Observations](#Observations_-_MSA) - [d) Top 20 colleges by applications](#d%29-Top-20-colleges-by-applications) - [Observations](#Observations_-_Top_Apps) - [e) Further analysis of Elite colleges](#e%29-Further-analysis-of-Elite-colleges) - [Observations](#Observations_-_Top_Elite)- [Code help sources](#Code-help-sources)xxxxxxxxxx###### ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------xxxxxxxxxx# save default options and parametersdefop = options()defpar = par(no.readonly=T)# function to modify plot parametersplot_pars = function(w=7,h=7) {options(repr.plot.width=w, repr.plot.height=h)}xxxxxxxxxx###### ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------xxxxxxxxxx# a) import datafpath = file.path(getwd(),'datasets','College.csv')college = read.csv(fpath)dim(college)head(college)xxxxxxxxxx## alternatives## 1# setwd("E:/Data Science/Statistics/Intro to Statistical Learning with R")# df = read.csv("./datasets/College.csv")## 2# download.file("https://statlearning.com/College.csv", destfile="College.csv", mode="wb")# mode='wb' >> for windows# df = read.csv("College.csv")# head(df)xxxxxxxxxx# missing values in datasum(is.na(college))xxxxxxxxxx# columns with missing valuesnames(college)[colSums(is.na(college)) > 0]xxxxxxxxxx# rows with missing valuesrownames(college)[apply(college, 1, function(x) sum(is.na(x))) > 0]xxxxxxxxxxstr(college)xxxxxxxxxx<div class="alert alert-block alert-info"><a id='Preliminary_Observations'></a><b>Preliminary observations:</b><br> - No missing values.<br> - Currently, college/universities' names form part of the dataset. They will be added as rownames and removed from the executable data.<br> - Categorical variable 'Private' is presently saved as character. It will be converted to factor.</div>xxxxxxxxxx###### ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------xxxxxxxxxx# Create rownames from college namesrownames(college) = college[,1]# Remove column containing namescollege = college[,-1]# Reclassify 'Private' as factorcollege$Private = as.factor(college$Private)# Confirm changescollege[sample.int(nrow(college), 3), ]xxxxxxxxxx###### ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------xxxxxxxxxx# Summary of variables : # numerical >> descriptive statistics# categorical >> frequency table# character >> countsummary(college)xxxxxxxxxx# factor columnsfcts = Filter(f = is.factor, college)# show levelssapply(fcts, levels)# alternatively# sapply(college[sapply(college, is.factor)], levels)xxxxxxxxxx###### ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------xxxxxxxxxx# Select variables for pairs plotselected_vars = setdiff(names(college), c('F.Undergrad','P.Undergrad','Books', 'S.F.Ratio','perc.alumni'))length(selected_vars)selected_varsxxxxxxxxxx# increase plot sizeplot_pars(17,17)# scatterplot matrixpairs(college[,selected_vars], pch = 19, cex = 0.5, lower.panel=, cex.labels=1.8)# reset plot paramsoptions(defop)xxxxxxxxxx<div class="alert alert-block alert-info"><a id='Observations_Pairplots'></a><b>Tentative observations:</b><br> - Outstate fees for private colleges has a higher spread and overall magnitude as compared to non-private.<br>- Private colleges have a much higher 'Expend' (instructional expenditure per student)<br>- There is moderately positive relationship bet the colleges preferred by the Top10perc and the Outstate tuition charged<br></div>xxxxxxxxxx##### #'#####################################################################xxxxxxxxxx# High correlation relations# Correlation of numeric columnscorr = cor(college[sapply(college, is.numeric)])# Display |correlations| > 0.55as.data.frame(apply(corr, 2, function(x) ifelse (abs(x) >=0.55, round(x,3), "-")))xxxxxxxxxx##### #'#####################################################################xxxxxxxxxx# Correlation plotsuppressPackageStartupMessages(library(corrplot))plot_pars(9,8)corrplot(cor(corr), method='number', tl.cex=1, number.cex=1)options(defop)xxxxxxxxxx###### ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------xxxxxxxxxxlibrary(ggplot2)xxxxxxxxxx# Plot parametersplot_pars(10,4)# Outstate vs Private boxplotggplot(college, aes(y=Private, x=Outstate, fill=Private)) + geom_boxplot() +labs(title='Outstate distribution in Private and Non-private') +theme_bw() +theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(), axis.text=element_text(size=14, face="bold"), axis.title.x = element_text(size = 16), axis.text.x = element_text(size=15, angle=45, hjust=1))# size=rel(3.5) to change all# reset plot dimensionsoptions(defop)xxxxxxxxxx<div class="alert alert-block alert-info"><a id='Observations_-_Outstate_~_Private'></a><b>Observations:</b><br>- As we had seen in the pairs plot, out-of-state tuition charged by the private colleges is much more.</div>xxxxxxxxxx###### ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------xxxxxxxxxx### 2.8c.4 - EliteElite >> universities with Top10perc > 50%Elite >> universities with Top10perc > 50%
xxxxxxxxxx# Glance at Top10percplot.new()par(mar=c(2,2,1,1))par(oma=c(1,2,3,1))plot_pars(17,5)par(fig=c(0,0.35, 0,1), new=TRUE)hist(college$Top10perc, nclass='FD', col='thistle', cex.main=1.5, main='Histogram')par(fig=c(0.35,1, 0,1), new=TRUE)boxplot(Top10perc ~ Private, data=college, horizontal=T, col=5:6, boxwex=0.3, frame=F, main="Top10perc in Private v Non-Private", cex.main=1.5, ylab="Private", names=c('Private','Non-Private'), las=1, xlab="Top10perc Distribution")# mtext("Top10perc", side=3, adj=0.40, cex=1.8, line=0, outer=TRUE)title('Top10perc', adj=0.4, cex.main=2, line=1, outer=T)options(defop)xxxxxxxxxx# Create new variable 'Elite'college$Elite = as.factor(ifelse(college$Top10perc > 50, 'Yes', 'No'))summary(college$Elite)xxxxxxxxxx# Elite and privatesum(college$Elite=="Yes")sum(college$Elite=='Yes' & college$Private=='Yes')xxxxxxxxxxcollege[sample.int(nrow(college), 3), ]xxxxxxxxxx# Outstate Tuition - Eliteplot_pars(15,4)boxplot(Outstate ~ Elite, data=college, horizontal=T, col=5:6, frame=F, main="Outstate Tuition in Elite v Non-elite", boxwex=0.5, las=1, ylab="Elite", xlab="Outstate tuition (USD)")options(defop)xxxxxxxxxx<div class="alert alert-block alert-info"><a id='Observations_-_Elite'></a><b>Observations:</b><br>- 83% (65 of 78) of the 'Elite' institutions are private.<br>- The distribution of Outstate tuition in Elite universities is heavily right-skewed indicating that most of the 'Elite' institutions charge high out-of-state tuition.<br>- The median Outstate tuition in Elite institutions is much higher than in Non-elite instituition, pointing to a clear difference between the educational accessibility for out-of-state students.</div>xxxxxxxxxx###### ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------xxxxxxxxxx### 2.8c.5 - HistogramsFreedman-Diaconis method has been used for calculating bin-widths.Freedman-Diaconis method has been used for calculating bin-widths.
xxxxxxxxxx###### a) Student expenditure related variablesxxxxxxxxxx# Total student expenditure (with and without Outstate tuition)# Student exp = Room.Board + Books + Personal + Outstatecollege$TExp.without.O = college$Room.Board + college$Books + college$Personalcollege$TExp.with.O = college$Outstate + college$Room.Board + college$Books + college$Personalxxxxxxxxxx"a) Student expenditure related variables"# selected variablesvars = c(unlist(strsplit('Outstate, Room.Board, Books, Personal, TExp.without.O, TExp.with.O, Expend', split=', ')))xlabels = varsylabels = rep('Freq', 7)# 2 by 3 gridpar(mfrow = c(3,3))par(mar=c(5.1, 4.1, 4.1+2, 2.1)) # b,l,t,rpar(oma=c(0,0+2,0,0))# resize plot paramsplot_pars(17,15)# function to generate histfor (i in seq(length(vars))) { var = vars[i] hist(college[, var], nclass='FD', col=c(i+1), main=paste(var), xlab=xlabels[i], ylab=ylabels[i], cex.main=2, cex.lab=1.7, cex.axis=1.5) median.var = median(college[, var]) mtext(paste('median : ', median.var), 3, adj = 0.5, line = 0.5, cex=1.2, col=) abline(v=median.var, lty=2, lwd=2)}par(defpar)options(defop)xxxxxxxxxx<div class="alert alert-block alert-info"><a id='Observations_-_Student'></a><b>Observations:</b><br>- Out-of-state tuition and Room.Board expenses are slightly positively skewed.<br>- Expenditure on 'Books', 'Personal' expenses of students and 'Expend' (instructional exp per student) are positively skewed.<br> - Median Total Expenditure with Outstate tuition is \$16,079. <br>  Median household income in the same year (1995) as per the <a href="https://www.census.gov/library/publications/1996/demo/p60-193.html">US Census</a> was ≈ $34,000.</div>xxxxxxxxxx# boxplots - Student expenses with and without Outstate tuitionplot_pars(17,4)par(oma=c(0,0+1,0,0))boxplot(college[tail(names(college), 2)], horizontal=T, col=5:6, main='Student expenses with and without Outstate tuition', xlab='Total Student Exp (Room.Board + Book + Personal)', ylab='Outstate Tuition', names=c('without','with'), las=0, boxwex=0.5, cex.lab=1.5, cex.axis=1.3, cex.main=1.7, frame=F)par(defpar)options(defop)xxxxxxxxxx# Skewness and KurtosissuppressPackageStartupMessages(library(psych))t(round(describe(college[vars])[,c('skew','kurtosis')], 4))xxxxxxxxxx###### ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------xxxxxxxxxx"b) Faculty and student related ratios"# selected variablesvars = c(unlist(strsplit('PhD Terminal S.F.Ratio perc.alumni Grad.Rate', split=' ')))xlabels = varspar(mfrow=c(2,3))par(mar=c(5.1, 4.1, 4.1+2, 2.1)) # b,l,t,rpar(oma=c(0,0+2,0,0))plot_pars(15,10)for (i in seq(length(vars))) { var = vars[i] hist(college[,var], nclass='FD', col=i+1, main=var, xlab=xlabels[i], ylab='Freq', cex.main=2, cex.lab=1.6, cex.axis=1.5) median.var = median(college[, var]) mtext(paste('median : ', median.var), 3, adj = 0.5, line = 0.5, cex=1.2) abline(v=median.var, lty=2, lwd=2)}par(defpar)options(defop)xxxxxxxxxx<div class="alert alert-block alert-info"><a id='Observations_-_Faculty_Student_Ratios'></a><b>Observations:</b><br>- PhD and Terminal are hevily left-skewed, i.e. most of the faculty is highly specialised in their respective disciplines.<br>   PhD has one bin > 100. This could be a mistake.<br>- Not many colleges have student-faculty ratio > 20.<br>- There is wide fluctuation in Graduation rate with 17.63% of institutions having graduation rates below 50%.<br> - IQR (Q3-Q1) of alumnis who donate ranges from 13% to 31%.<br> **Note**: see workings below for calculations<br> </div>Note:
see workings below for calculations
xxxxxxxxxx# Institutions with PhD > 100cbind(rownames(college)[which(college$PhD > 100)], college[which(college$PhD > 100), 'PhD'])xxxxxxxxxx## Graduation ratesummary(college$Grad.Rate)## No. of Istitutions with graduation rate < 50%length(which(college$Grad.Rate < 50))xxxxxxxxxx## % of Istitutions with graduation rate < 50%round(137/nrow(college)*100, 2)# alternatively# percentile = ecdf(college$Grad.Rate)# percentile(49.99)*100xxxxxxxxxx# Donor alumnissummary(college$perc.alumni)xxxxxxxxxx###### ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------xxxxxxxxxx##### a) Spending patterns - private vs non-privatexxxxxxxxxx# Proportion of Privatea = data.frame(table(college$Private))names(a) = c('Private', 'Count')a$Prop = round(prop.table(table(college$Private)), 3)a$Median.Personal = tapply(college$Personal, college$Private, median)a[,'>1649'] = tapply(college$Personal, college$Private, function(x) sum(x > 1649))axxxxxxxxxx# Histogramspar(mfrow=c(1,2))par(oma=c(0,0+1,0,0))plot_pars(10,4)hist(college$Personal[college$Private=='Yes'], col=5, cex.main=1.5, cex.lab=1.3, cex.axis=1.2, las=1, main='Personal exp in Private', xlab='Personal Exp', ylab='freq', nclass='FD')hist(college$Personal[college$Private!='Yes'], col=6, cex.main=1.5, cex.lab=1.3, cex.axis=1.2, las=1, main='Personal exp in Non-Private', xlab='Personal Exp', ylab='freq', nclass='FD')par(defpar)options(defop)xxxxxxxxxx# Boxplot - Student personal exp in Private v Non-privatepar(oma=c(0,0+1,0,0))plot_pars(10,4)boxplot(Personal ~ Private, data=college, horizontal=T, col=5:6, main='Personal exp in Private v Non-private', las=1, cex.main=1.7, cex.lab=1.5, cex.axis=1.2, frame=F)par(defpar)options(defop)xxxxxxxxxx<div class="alert alert-block alert-info"><a id='Observations_-_Student_spending'></a><b>Observations:</b><br>- Distribution in private is highly positively skewed while distribution in non-private is moderately positively skewed.<br> - Median personal spending by students in non-private (\$1649) is higher than in private (\$1100).<br>- The number of institutions where Personal spending is > \$1649 (median(non-private)) is almost similar for both private (103) and non-private (106).</div>xxxxxxxxxx###### ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------xxxxxxxxxx##### b) Most sought after college/universityInstitutions with high - Top10perc- Top25perc- Apps- No. of Apps per Enroll- No. of Enroll per AcceptInstitutions with high
xxxxxxxxxx## Selected factorsvars = c(unlist(strsplit('Apps, Accept, Enroll, Top10perc, Top25perc, F.Undergrad', split=', ')))# Creating subset of dataframeadmission.data = college[,vars]## Calculating ratios# Application ratioadmission.data$apr = round(admission.data$Apps / admission.data$Enroll * 100)# Acceptance rateadmission.data$acr = round(admission.data$Accept / admission.data$Apps * 100)# Enrollment rateadmission.data$enr = round(admission.data$Enroll / admission.data$Accept * 100)str(admission.data)xxxxxxxxxx# Top 100 colleges by Top10perctop100_Top10perc = admission.data[order(admission.data$Top10perc, decreasing=T),][1:100,]dim(top100_Top10perc)head(top100_Top10perc,10)xxxxxxxxxx# Top 100 colleges by Top25perctop100_Top25perc = admission.data[order(admission.data$Top25perc, admission.data$Top10perc, decreasing=c(T,T)), ][1:100, ]head(top100_Top25perc,10)xxxxxxxxxx# Top 300 colleges by no. of applicationstop300_Apps = admission.data[order(admission.data$Apps, decreasing=T),][1:300,]top300_Apps[1:5,]xxxxxxxxxx# Top 300 colleges by application rate (Apps / Enroll)top300_apr = admission.data[order(admission.data$apr, decreasing=T),][1:300,]head(top300_apr,5)xxxxxxxxxx# Top 300 colleges by enrollment rate (Enroll / Accept)top300_enr = admission.data[order(admission.data$enr, decreasing=T),][1:300,]top300_enr[1:5,]xxxxxxxxxx# commoncommon = Reduce(intersect, list(rownames(top100_Top10perc), rownames(top100_Top25perc), rownames(top300_Apps), rownames(top300_enr), rownames(top300_apr)))length(common)xxxxxxxxxx###### Most Sought-after Colleges/Universities (Final list)xxxxxxxxxx# Most Sought-after Colleges / Universitiesmsal = top100_Top10perc[rownames(top100_Top10perc) %in% common, ]msal# Function to show customized percentilescustom_summary = function(df, r=NULL, p_reqd_n=NULL, p_lev_n=NULL) { p_reqd=c(0, 0.10, 0.25, 0.50, 0.75, 0.90, 1) p_lev=c('0%','10%','25%','50%','75%','90%', '100%') ndf = Filter(is.numeric, df) ndf_vars = colnames(ndf) p_reqd = c(p_reqd, p_reqd_n) p_reqdm = sort(p_reqd) p_lev = c(p_lev, p_lev_n) p_levm = p_lev[order(p_reqd)] p_levm = append(p_levm, 'Mean', which(p_levm == '50%')) res = data.frame(row.names=p_levm) for (var in ndf_vars) { x = ndf[,var] sm = data.frame("dStats" = quantile(x, p_reqdm)) res = cbind(res, i=sm[match(rownames(res),rownames(sm)), ]) names(res)[length(names(res))] = var res['Mean', var] = mean(x) } rownames(res)[which(rownames(res) == '0%')] = 'Min' rownames(res)[which(rownames(res) == '100%')] = 'Max' if (missing(r)) fres=res else fres=round(res,r) return(fres)}custom_summary(admission.data, 0, c(0.05,0.95), c('5%','95%'))xxxxxxxxxx###### ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------xxxxxxxxxx##### c) Further analysis of Most sought-after colleges/univeristiesxxxxxxxxxxmsa = college[rownames(msal),]table(msa$Private)xxxxxxxxxxcustom_summary(college, r=0, c(0.05,0.95), c('5%','95%'))options(repr.matrix.max.cols=50)msaoptions(defop)xxxxxxxxxx<div class="alert alert-block alert-info"><a id='Observations_-_MSA'></a><b>Observations:</b><br> - 14 of the 16 (87.5%) MSAs (most sought-after institutions) are private.<br> - 15 (93.75%) have Grad.Rate > 90th percentile with 1 having ≈ 85th percentile with 83% grad rate.<br> - Outstate tuition for 13 (81.25%) MSAs is among the top 90th percentile.<br> </div>xxxxxxxxxx## Percentile tablesubset_df = msabase_df = collegevar_main = 'Top10perc'var = 'Grad.Rate'n = nrow(subset_df)k = 90# Percentilespercentiles_main = round(ecdf(base_df[,var_main])(subset_df[, var_main])*100, 2)percentiles = round(ecdf(base_df[,var])(subset_df[,var])*100,2)percentile_df = data.frame(rownames(subset_df), subset_df[,var], percentiles, percentiles_main)names(percentile_df) = c('College', var, 'Percentile', 'Percentile_Main')# Count below and above kth percentilecat(paste('<', k, ' :', sum(percentile_df$Percentile < k),';',sum(percentile_df$Percentile < k)/n*100))cat(paste('\n>=', k, ':', sum(percentile_df$Percentile >= k),'; ',sum(percentile_df$Percentile >= k)/n*100))cbind(S.N=1:nrow(percentile_df), percentile_df[order(percentile_df$Percentile, decreasing=T),])# Boxplots - Overall distribution v subset distributionplot_pars(12,4)boxplot(base_df[,var], horizontal=T, frame=F, boxwex=0.5, at = 0, col='cadetblue', main=var)boxplot(subset_df[,var], horizontal=T, frame=F, boxwex=0.3, at = 0.35, col='thistle',add=T)# Histograms - Overall and subsetplot_pars(12,8)par(mfrow=c(2,2))hist(base_df[,var], col='cadetblue', nclass='FD', main=paste(var, '\n(Overall)'), cex.axis=1.3)hist(subset_df[,var], col='thistle', nclass='FD', main=paste(var, '\n(Subset)'), cex.axis=1.3)# Scatterplots - Base variable v other variableplot(base_df[,var_main], base_df[,var], pch=19, bty="n", cex.main=1.4, cex.lab=1.2, cex.axis=1.3, main=paste(var_main,'v',var), xlab=var_main, ylab=var)plot(subset_df[,var_main], subset_df[,var], pch=19, bty="n", cex.main=1.4, cex.lab=1.2, cex.axis=1.3, main=paste(var_main,'v',var), xlab=var_main, ylab=var)par(defpar)options(defop)## Anderson-Darling test for comparing 2 sampleslibrary(kSamples)sig_lvl = 0.05ad.test(base_df[,var], subset_df[,var])## Kolmogorov-Smirnov testks.test(base_df[,var], subset_df[,var])xxxxxxxxxx##### '#####################################################################xxxxxxxxxx## Boxplotssubset_df = msa[sapply(msa, is.numeric)]base_df = collegeovars = names(subset_df)[names(subset_df) %in% names(base_df)]sig_lvl = 0.05plot_pars(12,3)for (var in ovars) { adt = ad.test(base_df[,var], subset_df[,var]) adp = paste('AD test:', round(adt$ad[[5]],5), ',', round(adt$ad[[6]],5)) ks = suppressWarnings(ks.test(base_df[,var], subset_df[,var])) ksp = paste('KS test:', round(ks$p.value, 5)) boxplot(base_df[,var], horizontal=T, frame=F, boxwex=0.5, at=0, col='cadetblue', main=var) abline(v=quantile(base_df[,var], c(0.05,0.1,0.9,0.95, 0.97426)), lty=c(3,4,4,3,5)) boxplot(subset_df[,var], horizontal=T, frame=F, boxwex=0.3, at=0.35, col='thistle', add=T) mtext(ksp, 3, adj = 0.05, line = 2, cex=1.1) mtext(adp, 3, adj = 0.05, line = 1, cex=1.1)}options(defop)xxxxxxxxxx##### '#####################################################################xxxxxxxxxx## Statistical test comparing variables and its subsettest_df = data.frame(matrix(NA, nrow=length(ovars), ncol=4))names(test_df) = c('KS','AD(v1)','AD(v2)', 'Significant')rownames(test_df) = ovarsfor (i in seq(length(ovars))) { var = ovars[i] # Kolmogrov-Smirnov test ks = suppressWarnings(ks.test(base_df[,var], subset_df[,var])) ksp = round(ks$p.value, 5) # Anderson-Darling test adt = ad.test(base_df[,var], subset_df[,var]) adp1 = round(adt$ad[[5]],5) adp2 = round(adt$ad[[6]],5) # add in df sig = ifelse((ksp<sig_lvl & adp1<sig_lvl & adp2<sig_lvl), 'Y', ifelse((ksp>sig_lvl & adp1>sig_lvl & adp2>sig_lvl), 'N', '-')) test_df[i, ] = c(ksp, adp1, adp2, sig)}test_dfxxxxxxxxxx###### ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------xxxxxxxxxx## Top 20 colleges by no. of applicationstop20_Apps = college[order(college$Apps, decreasing=T),][1:20,]options(repr.matrix.max.cols=30)top20_Apps = cbind(top20_Apps, admission.data[match(rownames(top20_Apps), rownames(admission.data)), c('apr','acr','enr')])custom_summary(college, r=0, c(0.05,0.95), c('5%','95%'))top20_Appsxxxxxxxxxx<div class="alert alert-block alert-info"><a id='Observations_-_Top_Apps'></a><b>Observations:</b><br> - 18 of the top 20 Apps colleges are non-private.<br> - Institutions with high applications (HAIs) also have high acceptance.<br> - High applications also accompany high enrollment numbers but the enrollment rates distribution for HAIs is not very different from the overall enrollment rates distribution. <br> This suggests that although applications are high, not many go on to enroll. Many applications could be backup applications.<br> - HAIs have a statistically significant distribution (compared to the overall sample) for all variables except Outstate, Room.Board, Books, Personal, S.F. Ratio, perc.alumni.<br><br><i>Note: See workings below.</i></div>## Percentile tablesubset_df = top20_Appsbase_df = collegevar_main = 'Apps'var = 'Grad.Rate'n = nrow(subset_df)k = 90# Percentilespercentiles_main = round(ecdf(base_df[,var_main])(subset_df[, var_main])*100, 2)percentiles = round(ecdf(base_df[,var])(subset_df[,var])*100,2)percentile_df = data.frame(rownames(subset_df), subset_df[,var], percentiles, percentiles_main)names(percentile_df) = c('College', var, 'Percentile', 'Percentile_Main')# Count below and above kth percentilecat(paste('<', k, ' :', sum(percentile_df$Percentile < k),';',sum(percentile_df$Percentile < k)/n*100))cat(paste('\n>=', k, ':', sum(percentile_df$Percentile >= k),'; ',sum(percentile_df$Percentile >= k)/n*100))cbind(S.N=1:nrow(percentile_df), percentile_df[order(percentile_df$Percentile, decreasing=T),])# Boxplots - Overall distribution v subset distributionplot_pars(12,4)boxplot(base_df[,var], horizontal=T, frame=F, boxwex=0.5, at = 0, col='cadetblue', main=var)boxplot(subset_df[,var], horizontal=T, frame=F, boxwex=0.3, at = 0.35, col='thistle',add=T)# Histograms - Overall and subsetplot_pars(12,8)par(mfrow=c(2,2))hist(base_df[,var], col='cadetblue', nclass='FD', main=paste(var, '\n(Overall)'), cex.axis=1.3)hist(subset_df[,var], col='thistle', nclass='FD', main=paste(var, '\n(Subset)'), cex.axis=1.3)# Scatterplots - Base variable v other variableplot(base_df[,var_main], base_df[,var], pch=19, bty="n", cex.main=1.4, cex.lab=1.2, cex.axis=1.3, main=paste(var_main,'v',var), xlab=var_main, ylab=var)plot(subset_df[,var_main], subset_df[,var], pch=19, bty="n", cex.main=1.4, cex.lab=1.2, cex.axis=1.3, main=paste(var_main,'v',var), xlab=var_main, ylab=var)par(defpar)options(defop)## Anderson-Darling test for comparing 2 sampleslibrary(kSamples)sig_lvl = 0.05ad.test(base_df[,var], subset_df[,var])## Kolmogorov-Smirnov testks.test(base_df[,var], subset_df[,var])xxxxxxxxxx# Accept - No. of new Applicationsplot_pars(12,4)boxplot(college$Accept, horizontal=T, frame=F, boxwex=0.5, at = 0, col='cadetblue', main='Accept')boxplot(top20_Apps$Accept, horizontal=T, frame=F, boxwex=0.3, at = 0.35, col='thistle',add=T)options(defop)xxxxxxxxxx# Apps - No. of new Applicationsplot_pars(12,4)boxplot(college$Apps, horizontal=T, frame=F, boxwex=0.5, at = 0, col='cadetblue', main='Apps')boxplot(top20_Apps$Apps, horizontal=T, frame=F, boxwex=0.3, at = 0.35, col='thistle',add=T)options(defop)xxxxxxxxxx# Enroll - No. of new students enrolledplot_pars(12,4)boxplot(college$Enroll, horizontal=T, frame=F, boxwex=0.5, at = 0, col='cadetblue', main='Enrollments')boxplot(top20_Apps$Enroll, horizontal=T, frame=F, boxwex=0.3, at = 0.35, col='thistle',add=T)options(defop)xxxxxxxxxx# Enroll rate - No. of new students enrolled / No. of applications acceptedplot_pars(12,4)boxplot(admission.data$enr, horizontal=T, frame=F, boxwex=0.5, at = 0, col='cadetblue', main='enr')boxplot(top20_Apps$enr, horizontal=T, frame=F, boxwex=0.3, at = 0.35, col='thistle',add=T)options(defop)xxxxxxxxxx##### #'####################################################################### Boxplotssubset_df = top20_Apps[sapply(college, is.numeric)]base_df = collegeovars = names(subset_df)[names(subset_df) %in% names(base_df)]sig_lvl = 0.05plot_pars(12,3)for (var in ovars) { adt = ad.test(base_df[,var], subset_df[,var]) adp = paste('AD test:', round(adt$ad[[5]],5), ',', round(adt$ad[[6]],5)) ks = suppressWarnings(ks.test(base_df[,var], subset_df[,var])) ksp = paste('KS test:', round(ks$p.value, 5)) boxplot(base_df[,var], horizontal=T, frame=F, boxwex=0.5, at=0, col='cadetblue', main=var) boxplot(subset_df[,var], horizontal=T, frame=F, boxwex=0.3, at=0.35, col='thistle', add=T) mtext(ksp, 3, adj = 0.05, line = 2, cex=1) mtext(adp, 3, adj = 0.05, line = 1, cex=1)}options(defop)xxxxxxxxxx##### #'#####################################################################x
## Statistical test comparing variables and its subsettest_df = data.frame(matrix(NA, nrow=length(ovars), ncol=4))names(test_df) = c('KS','AD(v1)','AD(v2)', 'Significant')rownames(test_df) = ovarsfor (i in seq(length(ovars))) { var = ovars[i] # Kolmogrov-Smirnov test ks = suppressWarnings(ks.test(base_df[,var], subset_df[,var])) ksp = round(ks$p.value, 5) # Anderson-Darling test adt = ad.test(base_df[,var], subset_df[,var]) adp1 = round(adt$ad[[5]],5) adp2 = round(adt$ad[[6]],5) # add in df sig = ifelse((ksp<sig_lvl & adp1<sig_lvl & adp2<sig_lvl), 'Y', ifelse((ksp>sig_lvl & adp1>sig_lvl & adp2>sig_lvl), 'N', '-')) test_df[i, ] = c(ksp, adp1, adp2, sig)}test_dfxxxxxxxxxx###### ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------xxxxxxxxxx# Elite colleges >> Top10perc > 50elite_df = college[which(college$Elite=='Yes'),]elite_df = elite_df[order(elite_df$Top10perc, elite_df$Top25perc, decreasing=T),]dim(elite_df)## Top 20 elite collegestop20_elite = elite_df[1:20,]custom_summary(college, 2, c(0.05,0.95), c('5%','95%'))options(repr.matrix.max.cols=30)top20_elitexxxxxxxxxx<div class="alert alert-block alert-info"><a id='Observations_-_Top_Elite'></a><b>'Elite'</b> : College/Universities that have >50% proportion of Top10perc students.<br><b>'Top10perc'</b> : % of new students from top 10% of their High School class<br><br><b>Observations:</b><br> - There are 78 (10% of 777) 'Elite' institutions.<br> - The distribution of every variable is different in 'Elite' colleges when compared with the variable's overall distribution, except in the case of 'Books' and 'Personal'.<br><br> <b>Top 20 Elite:</b><br> - Unsurprisingly, the top 'Elite' institutions also have the highest proportion of students that graduated in the top 25% of their high schools. All 20 are among the top 97th percentile of <b>Top25perc</b>.<br> - Min <b>'Phd'</b> and <b>'Terminal'</b> proportions are 91% and 92% respectively. <br> - 19 out of 20 institutions have faculty with <b>'PhD'</b>s within the top 90th percentile.<br> - 18 out of 20 institutions have faculty with <b>'Terminal'</b> degrees within the top 90th percentile.<br> - 16 of the 20 have <b>out-of-state tuition</b> among the top 90th percentile, with California-Irvine and California-Berkely being the notable outliers with 69th and 65th percentile respectively, and Georgia Institute of Technology being an extreme outlier with 15.7th percentile.<br> - 70% of the Top 20 Elite have <b>Room.Board</b> expenses among the top 85th percentile.<br> - <b>Student-faculty ratio</b> is generally lower than overall, with 14 of the top 20 having S.F.Ratio below the 25th percentile.<br>   Here again California-Irvine, California-Berkely and Georgetown Institute of Technology stand out with 73th, 71st and 91st percentiles respectively.<br> - 15 of the 20 are among the top 80th percentile in terms of proportion of alumni that donate (<b>perc.alumni</b>).<br> - <b>Graduation rates</b> are higher than the norm among the Elite institutions with 16 of the top 20 'Elite' having Grad.Rates above the 93rd percentile.</div>xxxxxxxxxx## Percentile table"Top 20 elite colleges"subset_df = top20_elitebase_df = collegevar_main = 'Top10perc'var = 'Grad.Rate'n = nrow(subset_df)k = 90# Percentilespercentiles_main = round(ecdf(base_df[,var_main])(subset_df[, var_main])*100, 2)percentiles = round(ecdf(base_df[,var])(subset_df[,var])*100,2)percentile_df = data.frame(rownames(subset_df), subset_df[,var], percentiles, percentiles_main)names(percentile_df) = c('College', var, 'Percentile', 'Percentile_Main')# Count below and above kth percentilecat(paste('<', k, ' :', sum(percentile_df$Percentile < k),';',sum(percentile_df$Percentile < k)/n*100))cat(paste('\n>=', k, ':', sum(percentile_df$Percentile >= k),'; ',sum(percentile_df$Percentile >= k)/n*100))cbind(S.N=1:nrow(percentile_df), percentile_df[order(percentile_df$Percentile, decreasing=T),])# Boxplots - Overall distribution v subset distributionplot_pars(15,4)boxplot(base_df[,var], horizontal=T, frame=F, boxwex=0.4, at = 0, col='cadetblue', main=var)abline(v=quantile(base_df[,var], c(0.05,0.1,0.9,0.95, 0.97426)), lty=c(3,4,4,3,5))boxplot(subset_df[,var], horizontal=T, frame=F, boxwex=0.3, at = 0.35, col='thistle',add=T)# Histograms - Overall and subsetplot_pars(12,8)par(mfrow=c(2,2))hist(base_df[,var], col='cadetblue', nclass='FD', main=paste(var, '\n(Overall)'), cex.axis=1.3)hist(subset_df[,var], col='thistle', nclass='FD', main=paste(var, '\n(Subset)'), cex.axis=1.3)# Scatterplots - Base variable v other variableplot(base_df[,var_main], base_df[,var], pch=19, bty="n", cex.main=1.4, cex.lab=1.2, cex.axis=1.3, main=paste(var_main,'v',var), xlab=var_main, ylab=var)plot(subset_df[,var_main], subset_df[,var], pch=19, bty="n", cex.main=1.4, cex.lab=1.2, cex.axis=1.3, main=paste(var_main,'v',var), xlab=var_main, ylab=var)par(defpar)options(defop)## Statistical test for comparing base and subset# Anderson-Darling test for comparing 2 sampleslibrary(kSamples)sig_lvl = 0.05ad.test(base_df[,var], subset_df[,var])# Kolmogorov-Smirnov testks.test(base_df[,var], subset_df[,var])xxxxxxxxxx##### #'#####################################################################xxxxxxxxxx## Boxplotssubset_df = top20_elite[sapply(top20_elite, is.numeric)]base_df = collegeovars = names(subset_df)[names(subset_df) %in% names(base_df)]library(kSamples) # for 2 sample Anderson-Darling testsig_lvl = 0.05plot_pars(12,3)for (var in ovars) { adt = ad.test(base_df[,var], subset_df[,var]) adp = paste('AD test:', round(adt$ad[[5]],5), ',', round(adt$ad[[6]],5)) ks = suppressWarnings(ks.test(base_df[,var], subset_df[,var])) ksp = paste('KS test:', round(ks$p.value, 5)) boxplot(base_df[,var], horizontal=T, frame=F, boxwex=0.5, at=0, col='cadetblue', main=var) abline(v=quantile(base_df[,var], c(0.05,0.1,0.9,0.95)), lty=c(3,4,4,3,5)) boxplot(subset_df[,var], horizontal=T, frame=F, boxwex=0.3, at=0.35, col='thistle', add=T) mtext(ksp, 3, adj = 0.05, line = 2, cex=1.1) mtext(adp, 3, adj = 0.05, line = 1, cex=1.1)}options(defop)xxxxxxxxxx##### #'#####################################################################xxxxxxxxxx## Statistical test comparing variables and its subsettest_df = data.frame(matrix(NA, nrow=length(ovars), ncol=4))names(test_df) = c('KS','AD(v1)','AD(v2)', 'Significant')rownames(test_df) = ovarsfor (i in seq(length(ovars))) { var = ovars[i] # Kolmogrov-Smirnov test ks = suppressWarnings(ks.test(base_df[,var], subset_df[,var])) ksp = round(ks$p.value, 5) # Anderson-Darling test adt = ad.test(base_df[,var], subset_df[,var]) adp1 = round(adt$ad[[5]],5) adp2 = round(adt$ad[[6]],5) # add in df sig = ifelse((ksp<sig_lvl & adp1<sig_lvl & adp2<sig_lvl), 'Y', ifelse((ksp>sig_lvl & adp1>sig_lvl & adp2>sig_lvl), 'N', '-')) test_df[i, ] = c(ksp, adp1, adp2, sig)}test_dfxxxxxxxxxx###### ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------xxxxxxxxxx### Code help sources[List of sources for the book](https://github.com/rahul-ahuja1/Solutions_-_An_Introduction_to_Statistical_Learning#solutions---an-introduction-to-statistical-learning)xxxxxxxxxx## Short description of variablesStatistics for a large number of US Colleges from the 1995 issue of US News and World Report. [Return to Index](#Index)• <b>Private</b> : Public/private indicator • <b>Apps</b> : Number of applications received • <b>Accept</b> : Number of applicants accepted • <b>Enroll</b> : Number of new students enrolled • <b>Top10perc</b> : New students from top 10 % of high school class • <b>Top25perc</b> : New students from top 25 % of high school class • <b>F.Undergrad</b> : Number of full-time undergraduates • <b>P.Undergrad</b> : Number of part-time undergraduates • <b>Outstate</b> : Out-of-state tuition • <b>Room.Board</b> : Room and board costs • <b>Books</b> : Estimated book costs • <b>Personal</b> : Estimated personal spending • <b>PhD</b> : Percent of faculty with Ph.D.’s • <b>Terminal</b> : Percent of faculty with terminal degree • <b>S.F.Ratio</b> : Student/faculty ratio • <b>perc.alumni</b> : Percent of alumni who donate • <b>Expend</b> : Instructional expenditure per student • <b>Grad.Rate</b> : Graduation rate Statistics for a large number of US Colleges from the 1995 issue of US News and World Report.
Return to Index
• Private : Public/private indicator
• Apps : Number of applications received
• Accept : Number of applicants accepted
• Enroll : Number of new students enrolled
• Top10perc : New students from top 10 % of high school class
• Top25perc : New students from top 25 % of high school class
• F.Undergrad : Number of full-time undergraduates
• P.Undergrad : Number of part-time undergraduates
• Outstate : Out-of-state tuition
• Room.Board : Room and board costs
• Books : Estimated book costs
• Personal : Estimated personal spending
• PhD : Percent of faculty with Ph.D.’s
• Terminal : Percent of faculty with terminal degree
• S.F.Ratio : Student/faculty ratio
• perc.alumni : Percent of alumni who donate
• Expend : Instructional expenditure per student
• Grad.Rate : Graduation rate